NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

https://doi.org/10.14778/3746405.3746426

Shankar, Shreya; Chambers, Tristan; Shah, Tarak; Parameswaran, Aditya G; Wu, Eugene (May 2025, Proceedings of the VLDB Endowment)

Analyzing unstructured data has been a persistent challenge in data processing. Recent proposals offer declarative frameworks for LLM-powered processing of unstructured data, but they typically execute user-specified operations as-is in a single LLM call—focusing on cost rather than accuracy. This is problematic for complex tasks, where even well-prompted LLMs can miss relevant information. For instance, reliably extractingallinstances of a specific clause from legal documents often requires decomposing the task, the data, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to deine such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we callrewrite directives), as well as an optimization and evaluation framework. We introduce(i)logical rewriting of pipelines, tailored for LLM-based tasks,(ii)an agent-guided plan evaluation mechanism, and(iii)an optimization algorithm that efficiently finds promising plans, considering the latencies of LLM execution. Across four real-world document processing tasks, DocETL improves accuracy by 21–80% over strong baselines. DocETL is open-source at docetl.org and, as of March 2025, has over 1.7k GitHub stars across diverse domains.
more » « less
Full Text Available
Physical Visualization Design: Decoupling Interface and System Design

Chen, Yiru; Li, Xupeng; Tao, Jeffrey; Ramjit, Lana; Mitra, Subrata; Ghaderi, Javad; Netravali, Ravi; Parameswaran, Aditya; Rubenstein, Dan; Wu, Eugene (June 2025, ACM)

Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer’s desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures. This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations. To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.
more » « less
Full Text Available
Querying Templatized Document Collections with Large Language Models

https://doi.org/10.1109/ICDE65448.2025.00183

Lin, Yiming; Hulsebos, Madelon; Ma, Ruiying; Shankar, Shreya; Zeighami, Sepanta; Parameswaran, Aditya G; Wu, Eugene (May 2025, IEEE)

Full Text Available
SET: Searching Effective Supervised Learning Augmentations in Large Tabular Data Repositories

Liu, Jiaxiang; Huang, Zezhou; Wu, Eugene (June 2024, GUIDE-AI Workshop)

Successful supervised learning models rely on predictive features, which rarely come from a single dataset. As a result, relevant datasets need to be integrated before training the actual model. This raises one natural question: \textit{``how can one efficiently search for predictive features from relevant datasets for integration with responsible AI guarantees?"}. This paper formalizes the question as the \textit{data augmentation search problem} with an objective of minimizing the search latency. We propose \sys, an interactive system that intakes a supervised learning task and searches for a set of join-compatible datasets that optimally improve the performance of the task. Specifically, \sys manages a corpus of relational datasets, uses linear regression as a \textit{proxy model} to evaluate augmentation candidates, and applies \textit{factorized machine learning} to accelerate model training and evaluation algorithmically. Furthermore, \sys leverages system and hardware optimizations to maximize parallelism across augmentation searches. These allow \sys to search for a good augmentation plan over 1 million datasets with a latency of $1.4$ seconds.
more » « less
Full Text Available
Physical Visualization Design: Decoupling Interface and System Design

https://doi.org/10.1145/3725334

Chen, Yiru; Li, Xupeng; Tao, Jeffrey; Ramjit, Lana; Mitra, Subrata; Ghaderi, Javad; Netravali, Ravi; Parameswaran, Aditya; Rubenstein, Dan; Wu, Eugene (June 2025, Proceedings of the ACM on Management of Data)

Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer's desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures. This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations. To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.
more » « less
Full Text Available
Lightweight Materialization for Fast Dashboards Over Joins

https://doi.org/10.1145/3626735

Huang, Zezhou; Wu, Eugene (December 2023, Proceedings of the ACM on Management of Data)

Dashboards are vital in modern business intelligence tools, providing non-technical users with an interface to access comprehensive business data. With the rise of cloud technology, there is an increased number of data sources to provide enriched contexts for various analytical tasks, leading to a demand for interactive dashboards over a large number of joins. Nevertheless, joins are among the most expensive operations in DBMSes, making the support of interactive dashboards over joins challenging. In this paper, we present Treant, a dashboard accelerator for queries over large joins. Treant uses factorized query execution to handle aggregation queries over large joins, which alone is still insufficient for interactive speeds. To address this, we exploit the incremental nature of user interactions using Calibrated Junction Hypertree (CJT), a novel data structure that applies lightweight materialization of the intermediates during factorized execution. CJT ensures that the work needed to compute a query is proportional to how different it is from the previous query, rather than the overall complexity. Treant manages CJTs to share work between queries and performs materialization offline or during user think-times. Implemented as a middleware that rewrites SQL, Treant is portable to any SQL-based DBMS. Our experiments on single node and cloud DBMSes show that Treant improves dashboard interactions by two orders of magnitude, and provides 10x improvement for ML augmentation compared to SOTA factorized ML system.
more » « less
Full Text Available
The Fast and the Private: Task-based Dataset Search

Huang, Zezou; Liu, Jiaxiang; Wang, Haonan; Wu, Eugene (January 2024, Conference on Innovative Data Systems Research)

Recent platforms utilize ML task performance metrics, not metadata keywords, to search large data corpus. Requesters provide an initial dataset, and the platform searches for additional datasets that augment---join or union---requester's dataset to most improve the model (e.g., linear regression) performance. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation.
more » « less
Full Text Available
DIG: The Data Interface Grammar

https://doi.org/10.1145/3597465.3605223

Chen, Yiru; Tao, Jeffrey; Wu, Eugene (June 2023, ACM)

Building interactive data interfaces is hard because the design of an interface depends on the data processing needs for the underlying analysis task, yet we do not have a good representation for analysis tasks. To fill this gap, this paper advocates for a Data Interface Grammar (DIG) as an intermediate representation of analysis tasks. We show that DIG is compatible with existing data engineering practices, compact to represent any analysis, simple to translate into an interface design, and amenable to offline analysis. We further illustrate the potential benefits of this abstraction, such as automatic interface generation, automatic interface backend optimization, tutorial generation, and workload generation.
more » « less
Full Text Available
JoinBoost: Grow Trees over Normalized Data Using Only SQL

https://doi.org/10.14778/3611479.3611509

Huang, Zezhou; Sen, Rathijit; Liu, Jiaxiang; Wu, Eugene (July 2023, Proceedings of the VLDB Endowment)

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in thenon-materialized join result.Although this view update problem is generally ambiguous, we identifyaddition-to-multiplication preserving, the key property of variance semi-ring to supportrmsethe most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
more » « less
Full Text Available
Teaching Data Science by Visualizing Data Table Transformations: Pandas Tutor for Python, Tidy Data Tutor for R, and SQL Tutor

https://doi.org/10.1145/3596673.3596972

Lau, Sam; Kross, Sean; Wu, Eugene; Guo, Philip J. (June 2023, ACM)

Full Text Available

« Prev Next »

Search for: All records